Experimental Design for Highthroughput experiments

Susan Holmes

12 July 2019

Design of High Throughput Experiments

RA Fisher Father of Experimental Design
image “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.”:
(Fisher 1935) (Presidential Address to the First Indian Statistical Congress, 1938.Sankhya 4, 14-17).

Goals for this Lecture

A Resource Problem – the Art of “Good Enough”

Experimental design1 rationalizes the tradeoffs imposed by having finite resources.

Our measurement instruments have limited resolution and precision; often we don’t know these at the outset and have to collect preliminary data providing estimates of the precision we can hope for.

Sample sizes are limited for practical and economic reasons.

We may only be able to observe the phenomenon of interest indirectly rather than directly.

Our measurements may be overlaid with nuisance factors over which we have limited control. There is little point in prescribing unrealistic ideals – we need to make pragmatic choices that are feasible.

Different types of studies

Observational

In an observational study, we do not control the assignment of important factors.

Perhaps we have recruited 20 patients that have a disease and fulfill some broad inclusion criteria (such as age, comorbidities, mental capacity) and ask them to take a drug each day exactly at 6 am. After 3 months, we take an MRI scan and lots of other biomarkers to see whether and how the disease has changed or whether there were any other side effects. We have much less control: people may forget to take the pill or take it at the wrong time. Some may feel that the disease got worse and stop taking the drug. Some may feel that the disease got better and stop taking the drug. Some may lead a healthy life-style, others eat junk food. They have varying levels of disease to start with. And all of these factors may be correlated with each other in unpredictable ways.

Randomized Controlled Studies

In a well-designed experiment, we have control over all relevant variables:
- the (model) system under study, the environmental conditions,
- the experimental readout.

For instance, we could have a well-characterized cell line growing in laboratory conditions on defined media, temperature and atmosphere, we’ll administer a precise amount of a drug, and after 72h we measure the activity of a specific pathway reporter.

Meta-analyses

Sometimes we do not design the experiment or study ourselves nor collect the data, but rather we perform a retrospective analysis of data that already happen to exist: this could be study of archive patient samples, or a re-analysis of an experiment previously done by someone else. When doing this we will need to use information to weight the different results or analyse them separately if they are too inhomogeneous.

Clinical trials

Take into account ethical constraints, psychological factors.

Ethics also comes in the optimal use of resrouces.

Variability, differences, noise and bias

Statisticians use the term error (different from its use in everyday language) for deviations of a measured value from the true value.

Two sources of error:
- those that will go away if we perform more replicates, randomize assignment to factors and take summary statistics –we call these noise
- those that remain: bias.
- needs to be diagnosed and modeled.

We devise models that parameterically incorporate these biases and estimate them along with the parameters of primary interest.

Examples

Error models: noise is in the eye of the beholder

The efficiency of most biochemical or physical processes involving DNA-polymers2 depends on their sequence content, for instance, occurrences of long homopolymer stretches, palindromes, overall or local GC content.

The sizes of these effects are not universal, but can also depend on factors like concentration, temperature, which enzyme is used, etc.

When looking at RNA-Seq data, should we treat GC content as noise or as bias?

Any particular measurement can be affected by multiple sources of noise and bias, and we can use formal models for this.

These can be very useful for experimental design and analysis. But there is no single way.

One person’s noise can be another’s bias.Coin flip

For instance, we think that the outcome of tossing a coin is completely random.

If we meticulously registered the initial conditions of the coin flip and solved the mechanical equations, we could predict which side has a higher probability of coming up: noise becomes bias (Diaconis, Holmes, and Montgomery 2007).

Coin flip

Coin flip

We use probabilistic modelling as a method to bring some order into our ignorance – but let’s keep in mind that these models are a reflection of our subjective ignorance, not an objective property of the world.

Someone else’s model may be different, and sometimes for a good reason.

Biological versus technical replicates

Imagine we want to test whether a weight loss drug works. Which of the following designs would you recommend:

Surely the first option must be better since it has 20 replicates on a very precise instrument rather than only three on an older piece of equipment?

This example shows the difference between technological versus biological replicates.

The number of replicates is less important than what types of variation are allowed to affect them.

The 20 replicates in the first design are wasted on re-measuring something that we already know with more than enough precision.

Whereas the far more important question –how does the effect generalize3 to different people– starts to be addressed with the second design, although in practice more people would be needed.

Analogous questions arise in high-throughput sequencing, e.g., when there is a choice to make between sequencing depth per sample, and number of samples.

For reliable variant calling with the sequencing technology used by the 1000 Genomes project, one needs about \(30\times\) coverage per genome. However, the average depth of the data produced was 5.1 for 1,092 individuals (1000 Genomes Project Consortium 2012). Why was that study design chosen?

The technical versus biological replicates terminology is helpful, but is often too coarse.

Error can creep in at many different levels: different labs, different operators within one lab, different technologies, different machines from the same technology, different variants of the protocol, different strains, litters, sexes, individual animals: we will use the notion of blocks.

If we know about some of nuisance factors4 and have kept track of them, we can include them explicitely as bias terms into our models.

If we did not keep track, we can try to use latent factor models to identify them from the data5.

A lack of units: using replicates to assess noise.

Measurements in physics are usually reported in SI units, such as meters, kilograms and seconds. An ampere measured by a lab in Australia using one instrument has the same meaning as an ampere measured a year later by a lab in Canada using a different instrument, or by a space probe orbiting Venus. Measurements in biology are, unfortunately, rarely as well standardized.

Often, absolute values are not reported (these would require units), but only fold changes with regard to some local reference.

Even when absolute values exist (e.g., read counts in an RNA-Seq experiment) they usually do not translate into universal units such as molecules per cell or mole per milliliter.

We will use the sums of squares within replicates to give us a yardstick for the noise in the experiment.

Regular and catastrophic noise

Regular noise can be modelled by simple probability models such as independent normal distributions, Poissons, or mixtures such as Gamma-Poisson or Laplace.

We can use relatively straightforward methods to take noise into account in our data analyses.

The real world is different: measurements can be completely off scale (a sample swap, a contamination or a software bug), and they can go awry all at the same time (a whole microtiter plate went bad, affecting all data measured from it). Such events are hard to model or even correct for – our best chance to deal with them is data quality assessment, outlier detection and documented removal.

Keeping track Dailies

Don’t confuse objectivity with stupidity.

In the same way a film director will view daily takes, to correct potential lighting, shooting issues before they affect too much footage, it is a good idea not to wait until all the runs of an experiment have been finished before looking at the data.

Intermediate data analyses and visualizations will track eventual unexpected sources of variation in the data and enable you to adjust the protocol.

Much is known about sequential design of experiments(Mead 1990), but even in a more pragmatic setting it is important to be aware of sources of variation as they occur and adjust for them.

Phylochip example

Phylochip example

Basic principles in the design of experiments

Clever combinations and balancing: a motivating example

Weighing example

Weighing example

A chemical balance weighing analogy (Hotelling (1944) and Mood (1946)).

Experimental design aims to maximize the available resources: capitilizing on cancellations and symmetry are important aspects.

Hotelling devised an improved weighing scheme using experimental design.

Given a set of eight objects of unknown weights \(\theta=(\theta_1,\theta_2,\theta_3,\theta_4,\theta_5,\theta_6,\theta_7,\theta_8)\).

For our experiment, we create a true \(\theta\) randomly.

set.seed(866588)
theta=round((sample(8,8)*2+rnorm(8)),1)
theta
## [1]  7.3  8.3 10.6  3.4 16.2  2.3 13.6 14.2

Method 1: Naïve method, eight weighings

Suppose we use a chemical scale that weighs each \(\theta_i\) individually with errors distributed normally with a sd of 0.1.

We compute the vector of errors and their variance as follows:

X1to8=theta+rnorm(8,0,0.1)
X1to8
## [1]  7.378750  8.130963 10.669482  3.485248 16.345350  2.027833 13.594994
## [8] 14.019513
errors1=X1to8-theta
errors1
## [1]  0.078749930 -0.169036947  0.069481586  0.085248445  0.145350255
## [6] -0.272167372 -0.005005807 -0.180486591
var(errors1)
## [1] 0.02385608

Method 2: Hotelling’s method, eight weighings

library("survey")
h8 = hadamard(6)
coef8= 2*h8-1
coef8
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    1    1    1    1    1    1    1
## [2,]    1   -1    1   -1    1   -1    1   -1
## [3,]    1    1   -1   -1    1    1   -1   -1
## [4,]    1   -1   -1    1    1   -1   -1    1
## [5,]    1    1    1    1   -1   -1   -1   -1
## [6,]    1   -1    1   -1   -1    1   -1    1
## [7,]    1    1   -1   -1   -1   -1    1    1
## [8,]    1   -1   -1    1   -1    1    1   -1

We use as the columns as the coefficients in a new weighing scheme.

The first column puts all the theta on one side of the balance and weighs them, call this \(Y[1]\).

The second column says we weigh all the objects together, the second says we place (1,3,5,7) on one side of the balance and (2,4,6,8) on the other and evaluate the difference, we call the results \(Y[2]\).

Each of the eight weighings has a normal error with sd=0.1.

Y = theta  %*% coef8 + rnorm(8,0,0.1)
thetahat = Y %*% t(coef8)/8

Now, because in this case we know the true theta’s we can compute the errors and their variance:

errors2 = as.vector(thetahat - theta)
errors2
## [1] -0.010100265 -0.003761982  0.035312519  0.030667252  0.024131208
## [6] -0.016158934  0.046076215 -0.014852304
var(errors2)
## [1] 0.0006354833
var(errors1)/var(errors2)
## [1] 37.54006

Experiment by simulation

We see that this variance is almost an order of magnitude smaller than var(errors1). Were we just lucky?

  1. Repeat Hotelling’s weighing experiment B=10,000 times with different theta and look at the sampling distributions of the variances of the errors in both schemes.
  2. Guess at the relationship between the two variances.
B=10000
errors1=matrix(0,nrow=B,ncol=8)
errors2=matrix(0,nrow=B,ncol=8)
tcoef8= t(coef8)/8
for (i in (1:B)){
theta=round((1:8)*2+rnorm(8),1)
X1to8o=theta+rnorm(8,0,0.1)
errors1[i,]=X1to8o-theta
Y= coef8 %*% theta + rnorm(8,0,0.1)
thetahat = tcoef8 %*% Y
errors2[i,] = thetahat-theta
}
apply(errors1,2,var)
## [1] 0.010141530 0.009737702 0.009982666 0.009927045 0.010062245 0.009979493
## [7] 0.010015036 0.010044065
apply(errors2,2,var)
## [1] 0.001271235 0.001228362 0.001263518 0.001267003 0.001238285 0.001251818
## [7] 0.001252250 0.001232865
apply(errors1,2,var)/apply(errors2,2,var)
## [1] 7.977700 7.927385 7.900690 7.835060 8.125951 7.971999 7.997635 8.146930

We say that the second scheme is more efficient than the first by a factor of 8 because the errors generated by the measurement have a variance which is 8 times lower. This example shows us that when several quantities are to be ascertained there is an opportunity to increase the accuracy and reduce the cost by combining measurements in one experiment and making comparisons between similar groups.

One factor at a time?

Ibn Sina Avicenna Physician Scientist
Avicenna In the Canon of Medicine, 1020 are listed seven rules of experimental design, including the need for controls and replication, the danger of confounding and the necessity of changing only one factor at a time

This dogma was overthrown in the XXth century by RA Fisher.

Comparing two levels of one factor: healthy or diseased.

## ----chap10-r-confounding-1, fig.keep = 'high', fig.cap = "Comparison of a (hypothetical) biomarker between samples from disease and healthy states. If we are only given the information shown in the left panel, we might conclude that this biomarker performs well in detecting the disease. If, in addition, we are told that the data were acquired in two separate batches (e.g., different labs, different machines, different time points) as indicated in the panel on the right hand side, the conclusion will be different.", fig.width = 4.25, fig.height = 3, echo = FALSE----
library("ggplot2")
library("gridExtra")
library("ggbeeswarm")
## load("../data/designI.rda")
p0 = ggplot(bat6, aes(x = state, y = exprs0)) +
       geom_boxplot(alpha = 0.5, col="blue") + geom_beeswarm(size = 2, cex = 6) + # geom_point(size = 2) +
       ylab("biomarker level")
grid.arrange(p0, p0 + geom_beeswarm(aes(col = batch), size = 2, cex = 6),  # geom_point(aes(col = batch), size = 2),
  ncol = 2, widths = c(1.3, 2))

Comparison of a (hypothetical) biomarker between samples from disease and healthy states. If we are only given the information shown in the left panel, we might conclude that this biomarker performs well in detecting the disease. If, in addition, we are told that the data were acquired in two separate batches (e.g., different labs, different machines, different time points) as indicated in the panel on the right hand side, the conclusion will be different.

states=factor(c("healthy","tumor")[rep(c(1,2),6)])
times=c(1,1,1,1,2,2,2,2,3,3,3,3)
facte  = 0.5*times
exprst = facte+rep(c(0.5,1),6)+rnorm(12,0,0.3)
exprs0  = rep(c(1.5,2),6)+rnorm(12,0,0.1)
batch = factor(c("Batch1","Batch2")[rep(c(1,2),6)])
stateN=factor(c("healthy","tumor")[rep(c(1,2),60)], levels=c("healthy","tumor"))
exprsN  = rep(c(1.5,2),60)+rnorm(120,0,0.3)
dfN=data.frame(stateN,exprsN)
mN=summarise(group_by(dfN,stateN),med=median(exprsN))
pN= ggplot(dfN, aes(x=stateN, y=exprsN))

pN1= pN+geom_boxplot(alpha=0.5,col="blue") +
      geom_point(size=2,alpha=0.5) +
  geom_segment(data=mN,aes(y=med[1],yend=med[2]),x=1.5,xend=1.5, col="red",arrow = arrow(length = unit(0.5, "cm"), ends = "both", type = "closed"))
   

pNb=pN+geom_boxplot(alpha=0.5,col="blue")+
      geom_beeswarm(size=2,alpha=0.5)
batdef=data.frame(states,exprst,exprs0,times,batch)
ms0=summarize(group_by(batdef,states),y=median(exprs0))
p0= ggplot(batdef, aes(x=states, y=exprs0))
p0= p0+geom_boxplot(alpha=0.5,col="blue") +
  geom_point(size=2)
p0batch= ggplot(batdef, aes(x=states, y=exprs0,color=batch))
p0batch= p0batch+geom_boxplot(alpha=0.5,col="blue") +
  geom_point(size=2)
p0effect= p0 + geom_segment(data=ms0,aes(y=ms0[1,2],yend=ms0[2,2]),x=1.5,xend=1.5, col="red",arrow = arrow(length = unit(0.5, "cm"), ends = "both", type = "closed"))

We cannot conclude because we are in the presence of confounding.

An experiment with higher noise levels, same number of points as in the previous study sample sizes (2 x 6) is not enough, but with the same error and a bigger sample: (2 x 60).

## ----Design-effectsize, fig.keep = 'high', fig.cap = "The red arrow shows the effect size, as measured by the difference between the centers of the two groups. Here we locate the centers by the medians; sometimes the mean is used.", echo = FALSE, fig.width = 2, fig.height = 3----
p0 + geom_segment(data = ms0, aes(y = ms0$y[1], yend = ms0$y[2]),
    x = 1.5, xend = 1.5, col = "red", arrow = arrow(length = unit(0.5, "cm"),
    ends = "both", type = "closed"))

## ----Design-comparesamplesize, fig.keep = 'high', echo = FALSE, fig.width = 3.7, fig.height = 3----
p = ggplot(bat6, aes(x = state, y = exprst)) + geom_boxplot(alpha = 0.5, col = "blue") +
    ylim(c(0.5, 3)) + ylab("biomarker level")
p1 = p + geom_beeswarm(size = 2, cex = 6) # geom_point(size = 2)
p2 = p + geom_beeswarm(aes(col = time), size = 2, cex = 6) # geom_point(aes(col = time), size = 2)

mN = summarise(group_by(bat60, state), med = median(exprs))
pN = ggplot(bat60, aes(x = state, y = exprs)) + geom_boxplot(alpha = 0.5, col="blue") +
  ylab("biomarker level") + ylim(c(0.5,3)) + geom_beeswarm(size = 2, cex = 2)  +
  geom_segment(data = mN, aes(y = med[1],yend=med[2]), x = 1.5, xend = 1.5,
               col = "red", arrow = arrow(length = unit(0.5, "cm"), ends = "both", type = "closed"))
grid.arrange(p1, pN, ncol = 2, widths = c(1.6, 2.5))

On the left, the boxplot was created with samples of size 6. On the right the sample sizes are 60. The measurements have the same underlying error distribution in both cases.

With the same effect size and a larger sample size, we have the power to see the difference.

t.test(exprs~state,data=bat60)
## 
##  Welch Two Sample t-test
## 
## data:  exprs by state
## t = -8.6555, df = 117.96, p-value = 2.951e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5911068 -0.3709906
## sample estimates:
## mean in group healthy mean in group disease 
##              1.574949              2.055997

Summarize what we see in these boxplots.

Depends on:

Theory of multifactorial experiments.

Useful language

Decomposition of variability: analysis of variance.

Blocking : the case of paired experiments.

ZeaMays

ZeaMays

Each pot in Darwin’s Zea Mays experiment is a block, only the factor of interest should be different (pollination method), all other factors should be kept equal within a block.

A balanced design is an experimental design where all the different factor combinations have the same number of observation replicates. Such data are particularly easy to analyse because the effect of each factor is identifiable.

When there are (likely) nuisance factors, it is good to make sure they are balanced with the factors of interest. Sometimes this is inconvenient or impractical for logistic or economic reasons – but in such cases analysts are on thin ice and need to proceed with caution.

Comparing paired versus unpaired design

When comparing various possible designs, we do power simulations.

Here, we suppose the sample size is 15 in each group and the effect size is 0.2. We also need to make assumptions about the standard deviations of the measurements, here we suppose both groups have the same sd=0.25.

set.seed(45123)
n=15; effect=0.2
pots=rnorm(15,0,1)
noiseh=rnorm(15,0,0.25)
noisea=rnorm(15,0,0.25)
hybrid=pots+effect+noiseh
autoz=pots+noisea

Perform both a simple t.test and a paired t.test, which is more powerful in this case?

t.test(hybrid,autoz,paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  hybrid and autoz
## t = 0.47985, df = 28, p-value = 0.6351
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5669088  0.9137679
## sample estimates:
##  mean of x  mean of y 
## 0.24714393 0.07371438
t.test(hybrid,autoz,paired=TRUE)
## 
##  Paired t-test
## 
## data:  hybrid and autoz
## t = 2.1722, df = 14, p-value = 0.04751
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.002186349 0.344672751
## sample estimates:
## mean of the differences 
##               0.1734295

Maybe we were just lucky in our random simulation here.

Check which method is more powerful. Run a parametric bootstrap experiment, generate data as above B=1000 times and compute the average probability of rejection for these 1000 trials, with a false positive rate \(\alpha=0.05\).

B=1000 ; n=15 ;effect=0.2
ppaired=rep(0,B)
pttest=rep(0,B)
for (i in 1 :1000){
pots=rnorm(15,0,1)
noiseh=rnorm(15,0,0.25)
noisea=rnorm(15,0,0.25)
hybrid=pots+effect+noiseh
autoz=pots+noisea
pttest[i]=t.test(hybrid,autoz,paired=FALSE)$p.value
ppaired[i]=t.test(hybrid,autoz,paired=TRUE)$p.value}
sum(pttest<0.05)/B
## [1] 0.001
sum(ppaired<0.05)/B
## [1] 0.526

We can plot the p-values obtained using both methods:

library("ggplot2")
dtp=data.frame(pvalues=c(pttest,ppaired),
    experiment=factor(c(rep("notpaired",B),rep("paired",B))))
m = ggplot(dtp, aes(pvalues, fill=experiment))
m + geom_histogram(binwidth=0.01,alpha=0.3)

Exercises

powercomparison = function(effect=0.2,n=15,B=1000,
                noisesd=0.25,potsd=1){
ppaired=rep(0,B);pttest=rep(0,B)
for (i in 1 :B){
pots=rnorm(n,0,potsd)
noiseh=rnorm(n,0,noisesd)
noisea=rnorm(n,0,noisesd)
hybrid=pots+effect+noiseh
autoz=pots+noisea
pttest[i]=t.test(hybrid,autoz,paired=FALSE)$p.value
ppaired[i]=t.test(hybrid,autoz,paired=TRUE)$p.value
}
PowerPaired=sum(ppaired<0.05)/B
PowerUnpaired=sum(pttest<0.05)/B
powers=cbind(PowerPaired,PowerUnpaired)
return(powers)
}

Here are a few values showing that when the pots sd is smaller than the noise sd, pairing hardly makes a difference. If the pots variability is larger than that of the measurement noise then pairing makes a big difference.

powercomparison(potsd=0.5,noisesd=0.25)
##      PowerPaired PowerUnpaired
## [1,]       0.495         0.028
powercomparison(potsd=0.25,noisesd=0.5)
##      PowerPaired PowerUnpaired
## [1,]       0.168         0.126
powercomparison(potsd=0.25,noisesd=0.1)
##      PowerPaired PowerUnpaired
## [1,]       0.999         0.495
powercomparison(potsd=0.1,noisesd=0.25)
##      PowerPaired PowerUnpaired
## [1,]       0.524         0.482

For 100 plants of each type and the two sd’s being 0.5, the power of the paired test is about 80%.

powercomparison(potsd=0.5,noisesd=0.5,n=100)
##      PowerPaired PowerUnpaired
## [1,]       0.788         0.497

take into account a natural pairing of the observations — for instance, twin studies, or studies of patients before and after a treatment. What can be done when pairing is not available.

try to create pairs of subjects that have as much similarity as possible through mathcing age, gender, background health etc. One is treated, the other serves as a control.

“Block what you can, randomize what you cannot”

(George Box, 1978)

Special designs: paired, complete block randomized,

Often we don’t know which nuisance factors will be important, or we cannot plan for them ahead of time.

In such cases, randomization is a practical strategy: at least in the limit of large enough sample size, the effect of any nuisance factor should average out.

Complete Random Block Design (CRB)

Complete randomized

Complete randomized

Balanced incomplete Block Design ()

balanced incomplete

balanced incomplete

Complete factorial Latin Squares

sudoku anyone?

sudoku anyone?

Randomization decreases bias.

Randomization helps inference.

Random does not mean haphazardly:

Controls, positive and negative: why?

We often need to remove variation due to unknown factors, or decompose variability according to its different sources; this is classically done using analysis of variance and mixed models that can accomodate random factors such as subject effects and fixed factors such as batch effects.

Usually these decompositions require at least 3 replicate measurements in each cell.

Removal of effects from unknown sources can only be done through the use of negative controls6.

Calibration of the effect size in an experiment also requires the use of positive controls; spike-ins (for instance External RNA Control Consortium controls as used in Risso et al. (2014)) where a known quantity or a known expression level aid in these calibrations and are a standard part of many experimental protocols.

Validation in independent data / independent lines of argument

If it is too good to be true, it usually is? An anecdote7

How many replicates do I need?

We use preliminary data and simulation experiments calculating how many nucleotides were necessary to achieve a 80% true positive rate when we knew the alternative.

Now, recall the discussion of experiments versus studies.

For the cell line experiment, we might get the correct result already from one replicate; usually we’ll do two or three to be sure.

On the other hand, for the study, our intuition tells us that there is so much uncontrolled variability that 20 is likely far too few, and we may need 200 or 2,000 patients for a reliable result. The number of replicates needed is highly context specific. It depends on the amount of uncontrolled8 variability, and the effect size. A pragmatic approach is to check out previous successful (or unsuccessful) experiments or studies that did something comparable and use simulations, subsampling or bootstrapping to get an estimate of the prosed study’s power. Here are more details about how to go about this in practice.

Power depends on sample sizes, effect sizes and variability.

Effective sample size for dependent data.

Example: To have a power of 80% to distinguish two groups (\(\alpha=0.05\)) one needs samples of size 64 in each group (128 measurements).\

For two time points that have a correlation of \(\rho=0.6\) one needs a sample of \(N=50\) measured twice (ie 200 measurements ).

Perturbation Design

Perturbation Design

Incomplete blocks, ragged arrays, missing data.

.red { background-color: red; }
Data Imputation...

Mean-Variance Relationships and Transformations

Previously we saw examples for data transformations whose that compress or stretch the space of quantitative measurements in such a way that the measurements’ variance is more similar throughout. Thus the variance is no loner highly dependent on the mean value9.

The mean-variance relationship of a measurement technology can in principle be any function, but in many cases, the following prototyic relationships hold:

  1. constant: the variance is independent of the mean.

  2. Poisson: the variance is proportional to to the mean.

  3. quadratic: the standard deviation is proportional to the mean, therefore the variance grows quadratically.

Give examples for biological assays (or measurement technologies) whose data show these types of relationships.

Real data can also be affected by a combination of these. For instance, with DNA microarrays, the fluorescence intensities are subject to a combination of background noise that is largely independent of the signal, and multiplicative noise whose standard deviation is proportional to the signal (Rocke and Durbin 2001). Therefore, for bright spots the multiplicative noise dominates, whereas for faint ones, the background.

Load up the raw data from a microarray experiment with replicates and verify the above statement.

We recall from Chapter ((???)) that for data with a mean-variance relationship \(v(\mu)\) the variance-stabilizing transformation \(g:\mathbb{R}\to\mathbb{R}\) fulfills the condition \[g'(x)=\frac{1}{\sqrt{v(x)}}\]

a) What are the variance-stabilizing transformations associated with the above three prototypic mean-variance relationships?
b) What is the variance stabilizing transformtaion which is appropriate for \(v(\mu) = v_0 + c\,\mu^2\), where \(v_0>0\) is a positive constant?

Data quality assessment and quality control

Data quality assessment (QA) and quality control (QC, i.e. the removal of insufficiently good data) are essential steps of any data analysis. Being critical of data quality (both of raw and derived data) should pervade all phases of analysis, from data import over model fitting, hypothesis testing to interpretation. The most useful tool for QA is usually visualisations (in Section (???)] we saw examples in the case of RNA-Seq data). If they highlight anomalies, it’s necessary to decide which remedial steps to take (for instance, to exclude certain parts of the data), and, probably, to redo the analysis, either with the same or a refined method. These decisions are necessarily subjective and context-dependent.

It’s helpful to be clear on what one means by quality, as the word can have many meanings. The most pertinent for us is fitness for purpose10. Back to the example of RNA-Seq data, the purpose of the experiment is, in many cases, the detection of differentially expressed genes between different biological conditions. The aim of QA/QC will then be the identification of data points or groups of data points (e.g., all data from one sample, all data from one gene) that apparently suffered from an anomaly that makes them detrimental to this purpose. Ordination plots and heatmaps are useful to identify outliers: for instance, samples that are not behaving as expected, because of a sample swap or misannotation; or genes that were not measured properly.

We saw an example for this in Section (???).

Longitudinal Data

Longitudinal data11 have time as a covariate. The first question is whether we are looking at a handful of time points –say, the response of a cell line measured 48h, 72h and 84h after exposure to a drug– or a long and densely sampled time series –say, patch clamp data in electrophysiology or a movie from life cell microscopy.

In the first case, time is usually best thought of as just another experimental factor, in the same way as we consider the concentration or the choice of drug. One analysis strategy could be to first identify the “best”’, or biologically most indicative, time point, and then focus on that. Or we can ask whether there is any effect at all, regardless of the time. We then just need to make sure that we account for the dependencies between the measurements over time, and effective sample sizes – analoguous to what we’ve discussed in Section (???). When designing the experiment, we’ll also try to sample more densely at those times when we expect most to happen.

In the second case, time series, we’ll often want to fit dynamical models to the data. We have many choices:

It’s outside the scope of this book to go into details, but it’s good to keep in mind that we have a huge number of choices12, and to approach our data with a problem looking for a method rather than a method looking for a problem mindset. Many of these methods originated in physics, econometrics or engineering, so it’s worthwhile to scan the literature in these fields.

Don’t Pretend You Are Dumb

There is some attraction to seemingly “unbiased” approaches that analyse the data at hand without any reference to what is already known. Such tendencies are reinforced by the fact that statistical methods have often been developed to be generic, for instance, working of a general matrix without specific reference to what the rows and column might signify.

Generic approaches are a good way to get started, and for analyses that are highly powered and straightforward, such an an approach might work out. But often, it is wasteful. Recall the example of an RNA-Seq experiment for differential expression. As we saw in Chapters @(Chap14) and @(Chap7), we could perform a hypothesis test for differential expression for each annotated gene, regardless of its read counts, and then run a multiple testing method that treats all tests as exchangeable. But this is inefficient – we can improve our detection power by filtering out, or downweighting, the tests with lower signal-to-noise ratio (Ignatiadis et al. 2016).

Other examples include:

Sharpen Your Tools: Reproducible Research

Analysis projects often begin with a simple script, perhaps to try out a few initial ideas or explore the quality of the pilot data. Then more ideas are added, more data come in, other datasets are integrated, more people become involved. Eventually the paper needs to be written, figures be redone ’properly’, and the analysis be saved for the scientific record and to document its integrity. Here are a few tools that can help with such a process.

Use an integrated development environment. RStudio is a great choice; there are also other platforms such as Emacs or Eclipse.

Use literate programming tools such as Rmarkdown or Jupyter. This is more readable (for yourself and for others) than burying explanations and usage instructions in comments in the source code or in separate README files, in addition you can directly embed figures and tables in these documents. Such documents are often good starting points for the supplementary material of your paper.

Anticipate re-engineering of the data formats and the software. The first version of how you choose to represent the data and structure the analysis workflows will rarely be the best. Don’t be afraid13 to make a clean cut and redesign them as soon as you notice that you are doing a lot of awkward data manipulations or repetitive steps. This is time well-invested. Sometimes it also helps to unearth bugs.

Reuse existing tools. Don’t reinvent the wheel and rather spend your time on things that are actually new. Before implementing a “heuristic” or a temporary “hack” that analyses your data in a certain way, spend a couple of minutes researching to see if something like this hasn’t been done before. More often than not, it has, and there is a clean, scalable, and already tested solution.

Use version control, such as . This takes some time to learn, but in the long run will be infinitely better than all your self-grown attempts at managing evolving code with version numbers, switches and the like. Moreover, this is also the sanest option for collaborative work on code, and it provides an extra backup of your codebase, especially if the server is distinct from your workplace machine.

Use functions rather than copy-pasting (or repeatedly -ing) stretches of code.

Use the R package system. Soon you’ll note recurring function or variable definitions that you want to share between your individual scripts. It is fine to use the R function to manage them initially, but it is never to early to move them into your own package – at the latest when you find yourself starting to write README files or long emails explaining others how to use some script or another. Assembling existing code into an R package is not hard by any means, and offers you many goodies including standardized and convenient ways to provide documentation, to show code usage examples, to test the correct functioning of your code, and to version it. Quite likely you’ll soon appreciate the benefit of using namespaces.

Centralize the location of the raw data files and streamline the derivation of intermediate data. Store the input data at a centralized file server that is professionally backed up. Mark the files as read-only. Have a clear and linear workflow for computing the derived data (e.g. normalised, summarised, transformed etc.) from the raw files, and store these in a separate directory. Anticipate that this workflow will need to be re-run several times, and version it. Use or similar tools14 to mirror these files on your personal computer.

Integration. When developing downstream analysis ideas that bring together several different data types, you don’t want to do the conversion from data type specific formats to the representations that machine learning or generic statistical methods use each time on an ad hoc basis. Have a recipe script that assembles the different ingredients and cooks them up as an easily consumable15 matrix, data frame or Bioconductor .
Keep a hyperlinked webpage with an index of all analyses. This is helpful for collaborators (especially if the page and the analysis can be accessed via a web browser) and also a good starting point for the methods part of your paper. Structure it in chronological or logical order, or a combination of both.

Data representation

Combining all the data so it is ready for analysis or visualisation often involves a lot of shuffling around of the data, until they are in the right shape and format for an analytical algorithm or a graphics routine.

Errors can occur, lost labels, lost information: be safe, redundancy is good.

Wide vs long table format

Recall Hiiragi data (for space reasons we print only the first five columns):

    ##             1 E3.25  2 E3.25  3 E3.25  4 E3.25  5 E3.25
    ## 1420085_at 3.027715 9.293016 2.940142 9.715243 8.924228
    ## 1418863_at 4.843137 5.530016 4.418059 5.982314 4.923580
    ## 1425463_at 5.500618 6.160900 4.584961 4.753439 4.629728
    ## 1416967_at 1.731217 9.697038 4.161240 9.540123 8.705340

This dataframe has several columns of data, one for each sample (annotated by the column names). Its rows correspond to the four probes, annotated by the row names. This is an example for a data table in wide format.

    ##   variable    value
    ## 1  1 E3.25 3.027715
    ## 2  1 E3.25 4.843137
    ## 3  1 E3.25 5.500618
    ## 4  1 E3.25 1.731217
    ## 5  2 E3.25 9.293016
    ## 6  2 E3.25 5.530016

In the resulting dataframe , each row corresponds to exactly one measured value, stored in the column . Then there are additional columns variable and value, which store the associated covariates.

Compare this to the dataframe above.

Now suppose we want to store somewhere not only the probe identifiers but also the associated gene symbols. We could stick them as an additional column into the wide format dataframe, and perhaps also throw in the genes’ ENSEMBL identifier for good measure. But now we immediately see the problem: the dataframe now has some columns that represent different samples, and others that refer to information for all samples (the gene symbol and identifier) and we somehow have to “know” this when interpreting the dataframe. This is what Hadley Wickham calls untidy data16. In contrast, in the tidy dataframe , we can add these columns, yet still know that each row forms exactly one observation, and all information associated with that observation is in the same row.

Ragged arrays

In tidy data (???),

  1. each variable forms a column,

  2. each observation forms a row,

  3. each type of observational unit forms a table.

A potential drawback is efficiency: even though there are only 4 probe – gene symbol relationships, we are now storing them 404 times in the rows of the dataframe . Moreover, there is no standardisation: we chose to call this column , but the next person might call it or even something completely different, and when we find a dataframe that was made by someone else and that contains a column , we can hope, but have no guarantee, that these are valid gene symbols. Addressing such issues is behind the object-oriented design of the specialized data structures in Bioconductor, such as the class.

Matrices versus dataframes

For a specific data type, it may not always be the most efficient way of storing data, and it cannot easily transport rich metadata (i.e., data about the data).17 For instance, our example dataset is stored as an object in Bioconductor’s class, which has multiple components, most importantly, the matrix with 45101 rows and 101 columns. The matrix elements are the gene expression measurements, and the feature and sample associated with each measurement are implied by its position (row, column) in the matrix; in contrast, in the long table format, the feature and sample identifiers need to be stored explicitly with each measurement. Besides, has additional components, including the dataframes and , which provide various sets metadata about the microarray features and the phenotypic information about the samples.

Analysis Workflow Design

Don’t immediately rush into downstream (“biological”) analysis, or run black-box algorithms. First make sure you understand the qualities of the data. Here are some questions and diagnostic plots:

Many examples of expression experiments with unfortunate designs exist. Lin et al. (2014) had an experimental design where the human and mouse samples were treated differently (different lanes, time to collect samples etc..) so that their confounding18 of batch factor and species factor precluded the verification of a difference between species (Gilad and Mizrahi-Man 2015).

Leaky pipelines and statistical sufficiency

Data analysis pipelines in high-throughput biology often work as ’funnels’ that successively summarise and compress the data. In high-throughput sequencing, we may start with individual sequencing reads, then align them to a reference, then only count the aligned reads for each position, summarise positions to genes (or other kinds of regions), then “normalize” these numbers by library size to make them comparable across libraries, etc. At each step, we loose some information, and it is important to make sure we still have enough information for the task at hand19. The problem is particularly burning if we use a data pipeline built from a series of separate components without enough care being taken ensuring that all the information necessary for ulterior steps is conserved.

Statisticians have a concept for whether certain summaries enable the reconstruction of all the relevant information in the data: sufficiency. In a Bernoulli random experiment with a known number of trials, \(n\), the number of successes is a sufficient statistic for estimating the probability of success \(p\).

In a 4 state Markov chain (A,C,G,T) such as the one we saw in Chapter @(Chap10), what are the sufficient statistics for the estimation of the transition probabilities?

Iterative approaches akin to what we saw when we used the EM algorithm can sometimes help avoid information loss. For instance, when analyzing mass spectroscopy data, a first run guesses at peaks individually for every sample. After this preliminary spectra-spotting, another iteration allows us to borrow strength from the other samples to spot spectra that may have been labeled as noise.

Parametric bootstrap for power simulations Power calculations need to be done before an experiment is run. However, when calculating power one needs to know estimates of quantities such as the effect size20 (the difference in means between two treatments) that will only be known after the experiment. The best way to work around this is to simulate with different possible values of the unknown parameters to get an idea of the actual sample sizes needed.

Generate data with:

To do this, we need to find a parametric family \(F_\theta\) that is not too terrible in modeling our data. Then we estimate \(\theta\) from the data generate whole sets of simulated data from the distribution \(F_{\hat{\theta}}\), while varying effect sizes, sample sizes, levels of replication, etc.

Summary

Issues should be carefully considered before doing an experiment.

There are two types of variation in an experiment: one is of interest, the other is unwanted.

We usually cannot rid ourselves of all the unwanted variation but we saw how we can used balanced randomized designs, data transformations that improve quality control.

We showed how to compute the power of our study by simulation experiments, these illustrated the necessity to carefully choose time points in longitudinal studies especially when perturbations are part of the design.

A reproducible workflow enables us to be transparent in our choices and enable us to evaluate the overall robustness of the results.

Technology specific issues

RNA-seq type experiments, metagenomics, chipseq

Read Coverage Depth Choice depends on the context and goal of the experiment. Sims et al. (2014) make some very important points about choices that have to be made before the experiment is carried out. With a finite number of runs and samples possible because of resource limitations, one had to …

Different problems require different depths

Mass spec, proteomics

See careful study by Oberg and Vitek (2009)

References and Further Reading

This lecture presented merely a pragmatic introduction to design,
there are many book long treatments of the subject that offer
detailed advice on setting up
experiments 

Wu and Hamada (2011); Box, Hunter, and Hunter (1978); Glass (2007)

1000 Genomes Project Consortium. 2012. “An Integrated Map of Genetic Variation from 1,092 Human Genomes.” Nature 491 (7422): 56–65.

Auer, Paul L, and RW Doerge. 2010. “Statistical Design and Analysis of RNA Sequencing Data.” Genetics 185 (2). Genetics Soc America: 405–16.

Bacher, Rhonda, and Christina Kendziorski. 2016. “Design and Computational Analysis of Single-Cell Rna-Sequencing Experiments.” Genome Biology 17 (1): 1.

Box, George EP, William G Hunter, and J Stuart Hunter. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. John Wiley & Sons.

Diaconis, Persi, Susan Holmes, and Richard Montgomery. 2007. “Dynamical Bias in the Coin Toss.” SIAM Review 49 (2). SIAM: 211–35.

Fisher, Ronald Aylmer. 1935. The Design of Experiments. Oliver & Boyd.

Gilad, Yoav, and Orna Mizrahi-Man. 2015. “A Reanalysis of Mouse Encode Comparative Gene Expression Data.” F1000Research 4. Faculty of 1000 Ltd.

Glass, David J. 2007. Experimental Design for Biologists. Cold Spring Harbor Laboratory Press.

Henderson, Fergus. 2017. “Software Engineering at Google.” ArXiv E-Prints.

Hotelling, Harold. 1944. “Some Improvements in Weighing and Other Experimental Techniques.” The Annals of Mathematical Statistics 15 (3). JSTOR: 297–306.

Ignatiadis, Nikolaos, Bernd Klaus, Judith Zaugg, and Wolfgang Huber. 2016. “Data-Driven Hypothesis Weighting Increases Detection Power in Genome-Scale Multiple Testing.” Nature Methods.

Jacob, Laurent, Guillaume Obozinski, and Jean-Philippe Vert. 2009. “Group Lasso with Overlap and Graph Lasso.” In Proceedings of the 26th Annual International Conference on Machine Learning, 433–40. ACM.

Koncan, Raffaella, Aránzazu Valverde, Marı'a-Isabel Morosini, Marı'a Garcı'a-Castillo, Rafael Cantón, Giuseppe Cornaglia, Fernando Baquero, and Rosa del Campo. 2007. “Learning from Mistakes: Taq Polymerase Contaminated with \(\beta\)-Lactamase Sequences Results in False Emergence of Streptococcus Pneumoniae Containing Tem.” Journal of Antimicrobial Chemotherapy 60 (3). Br Soc Antimicrob Chemo: 702–3.

Leek, Jeffrey T, Robert B Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead, W Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A Irizarry. 2010. “Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data.” Nature Reviews Genetics 11 (10): 733–39.

Leek, Jeffrey T., and John D. Storey. 2007. “Capturing heterogeneity in gene expression studies by surrogate variable analysis.” PLoS Genetics 3 (9): 1724–35.

Lin, Shin, Yiing Lin, Joseph R Nery, Mark A Urich, Alessandra Breschi, Carrie A Davis, Alexander Dobin, et al. 2014. “Comparison of the Transcriptional Landscapes Between Human and Mouse Tissues.” PNAS 111 (48). National Acad Sciences: 17224–9.

Mead, Roger. 1990. The Design of Experiments: Statistical Principles for Practical Applications. Cambridge University Press.

Mood, Alexander M. 1946. “On Hotelling’s Weighing Problem.” The Annals of Mathematical Statistics. JSTOR, 432–46.

Oberg, Ann L, and Olga Vitek. 2009. “Statistical Design of Quantitative Mass Spectrometry-Based Proteomic Experiments.” Journal of Proteome Research 8 (5). ACS Publications: 2144–56.

Risso, Davide, John Ngai, Terence P Speed, and Sandrine Dudoit. 2014. “Normalization of Rna-Seq Data Using Factor Analysis of Control Genes or Samples.” Nature Biotechnology 32 (9). Nature Research: 896–902.

Rocke, David M, and Blythe Durbin. 2001. “A Model for Measurement Error for Gene Expression Arrays.” Journal of Computational Biology 8 (6): 557–69.

Sims, David, Ian Sudbery, Nicholas E Ilott, Andreas Heger, and Chris P Ponting. 2014. “Sequencing Depth and Coverage: Key Considerations in Genomic Analyses.” Nature Reviews Genetics 15 (2). Nature Publishing Group: 121–32.

Stegle, O., L. Parts, R. Durbin, and J. Winn. 2010. “A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies.” PLoS Computational Biology 6 (5): e1000770.

Wiel, Mark A, Tonje G Lien, Wina Verlaat, Wessel N Wieringen, and Saskia M Wilting. 2016. “Better Prediction by Use of Co-Data: Adaptive Group-Regularized Ridge Regression.” Statistics in Medicine 35 (3): 368–81.

Wu, CF Jeff, and Michael S Hamada. 2011. Experiments: Planning, Analysis, and Optimization. Vol. 552. John Wiley & Sons.


  1. “Generally speaking, a well-designed experiment is one that is sufficiently powered and one in which technical artifacts and biological features that may systematically affect measurements are randomized, balanced, or controlled in some other way in order to minimize opportunities for multiple explanations for the effect(s) under study.” (Bacher and Kendziorski 2016)

  2. For instance, PCR, or being pulled through a nanopore

  3. Inference or generalizations can only be made to a wider population if we have a representative, randomized sample of that population in our study. In the first case if a weight loss occurs, one could only infer about that person at that time.

  4. Sometimes they’re also referred to as batch effects (Leek et al. 2010).

  5. This is particularly fruitful for high-dimensional data (Leek and Storey 2007; Stegle et al. 2010).

  6. Measurements in a natural state without any of changes tested in the experiments. A PCR reaction without any DNA or a sample obtained by simply swabbing the lab-bench or running water through PCR. See Koncan et al. (2007) for an example where the authors found contaminants in Taq polymerase.

  7. or, as in the case of drugs that we want to give to actual people, inherently uncontrollable

  8. The variance of, say, different genes can be different for many reasons, of which the mean is only one – so the goal of variance-stabilizing transformation is not to make all variances the same, but to remove not more, and not less, than the influence of the mean.

  9. http://en.wikipedia.org/wiki/Quality_%28business%29

  10. A related but different concept are survival data, in which time is the outcome variable. A characteristic feature of these data is censoring, i.e., for many samples the outcome event is not actually observed, we only know that it must be later than some time point. A prototypic application is clinical studies, where we observe the time from a medical procedure until an event such as relapse of the disease.

  11. One start point is the CRAN taskview https://cran.r-project.org/web/views/TimeSeries.html.

  12. The professionals do it, too: “Most software at Google gets rewritten every few years.” Henderson (2017)

  13. Commercial options include Dropbox, Google Drive.

  14. In computer science, the term data warehouse is sometimes used for such a concept.

  15. There are many different ways for data to be untidy.

  16. In other words, simple tables or dataframes cannot offer all the nice features provided by object oriented approaches, such as encapsulation, abstraction of interface from implementation, polymorphism, inheritance and reflection.

  17. Confounding can be removed by always randomizing the assignment of subjects to different groups.

  18. For instance, for the RNA-Seq differential expression analysis that we saw in Chapter @(Chap7), we needed the actual read counts, not “normalized” versions; for some analyses, gene-level summaries might suffice, for others, we’ll want to look at the exon or isoform level.

  19. effect size:The amplitude of the unknown difference between two groups is called the effect size.